Demonstration of Plotting and Density Based Clustering Methods Using GNACT

image-2.png

Reducing dense, geospatial datasets to starts and stops of indvidual selectors challenges conventional GIS techniques. Geospatial Netowrk Analysis with Clustering and Time (GNACT) is a Python pacakge that helps parse and analyze geospatial trajectoreis.

In the following examples, we will use ship position data from AIS to track the MSC ARUSHI (MMSI:636016432) during its voyages in 2017. We'll compare several different density based clustering methods, static trip segmentation based on known sites, and an implementation of dynamic trip segmentation from Scikit-Mobility. Using GNACT, we will plot the trajectories, calculate the precision, recall, and F1 measure against a "gold-standard" ground truth, and plot clusters in different ways to help an analyst understand how a particular clustering methodology or set of hyperparameters is working.

Plot of Raw Data

Use the World Port Index as Reference Sites

This list is not exhaustive, but its a good example of a real-world reference dataset where most of the major sites are known but many smaller sites are not.

Finding Stops/Clusters at Known Ports Using Static Trip Segmentation

We will generate clusters for all positions that spend a minimum amount of time within a certain distance of any known site. This is known as static trip segmentation in the literature, and will have the lowest false positive rate of any method because all positions clustered must be within a certain distance of a known port.

I previously developed a static trip segmentation methodology during a Directed Stuides on Network Analysis. By applying this methodology against a geospatial dataset with a known set of ports, I generated a network map with each stop repersenting a node and travel between ports as edges.

Calculate Nearest Site for Each Position and Create a List of Stops

The first step in static trip segmentation is to calculate the nearest known site for each position. Here we will apply the approach against a DataFrame using a custom function.

We will use a distance threshold of 5km and loiter time of 6 hours (360 minutes). This means a cargo ship must spend at least 6 hours within 5km of a known port to be counted as making a stop.

Building "Ground Truth" From Static Trip Segmentation Results

Now we can apply static trip segmentation against this data and use it as our "ground truth".

After plotting the ports visited, its clear that there was activity near Savannah that was not recorded. Turns out the port near Savannah where the MSC ARUSHI stopped was not in the database. We can add the port manually, re-generate our nearest_neighbor df, and recompute our statistics.

Review "Ground Truth"

Now our new Savannah Port is correctly identifed as a ground truth cluster. We can next use our code to generate clusters and compare them to the ground truth.

Apply Algoithms, Get Clusters, Compare to Ground Truth, and Plot Results

Plot of DBSCAN with Low Parameters

Plot of DBSCAN with High Parameters

DBSCAN with Tuned Parameters

DBSCAN With Speed Filter

OPTICS with Tuned Parameters

ST_DBSCAN with Tuned Parameters

Experimentation with Double Clustering Approaches

By clustering raw positions to centerpoints, and then clustering those centerpoints, we can use lower thresholds to cluster positions by ships. Then we can cluster the resulting centerpoints with a minimum sample size above a certain noise threshold.

Integration of Scikit-Mobility and Dynamic Segmentation of Trips

Scikit-Mobility provides additional plotting and packaging tools to parse a geospatial dataset into "trips" based on "stops" along each UID's path.

SKMOB has a "stop detection" algorithm to identify stops based on distance, duration, max speed, and includes an escape clause when there is a gap in data.

We can then use DBSCAN to cluster the stops to find "destinations" frequently visited.

Now we can add it to our existing function wrappers and determine the statistics